[#2039] Support Default Value Semantis: use default values reading AVRO files#72
Conversation
|
@wmoustafa @funcheetah @HotSushi please take a look |
core/src/main/java/org/apache/iceberg/avro/BuildAvroProjection.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/avro/BuildAvroProjection.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/avro/BuildAvroProjection.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/avro/BuildAvroProjection.java
Outdated
Show resolved
Hide resolved
| Schema.Field newField = new Schema.Field(newFieldName, newFiledSchema, null, defaultValue); | ||
| newField.addProp(AvroSchemaUtil.FIELD_ID_PROP, expectedField.fieldId()); |
There was a problem hiding this comment.
This branch seems to have common code with a the out if branch. Do you see room to combine and simplify?
There was a problem hiding this comment.
we have 4 cases (for 2 conditions), for three of them we create a new field and add to updatedFileds, but different fields. The only thing that can be done is to create a shallow 3-lines function with 5 argument. I prefer to keep as is
| Schema newSchemaReordered; | ||
| // if the newSchema is an optional schema, make sure the NULL option is always the first | ||
| if (isOptionSchemaWithNonNullFirstOption(newSchema)) { | ||
| // if the newSchema is an optional schema with no, or null, default value, then make sure the |
There was a problem hiding this comment.
How about if the newSchema is an optional schema or has a default value that is null, then make sure the NULL option is the first.
There was a problem hiding this comment.
(optional_no_default || optional_null_default) != (optional || has_default)
| if (isOptionSchemaWithNonNullFirstOption(newSchema)) { | ||
| // if the newSchema is an optional schema with no, or null, default value, then make sure the | ||
| // NULL option is the first | ||
| boolean hasNonNullDefaultValue = field.hasDefaultValue() && AvroSchemaUtil.isNonNullDefault(field.defaultVal()); |
There was a problem hiding this comment.
Refactoring as recommended above may help here.
| Object defaultValue = field.hasDefaultValue() && !(field.defaultVal() instanceof JsonProperties.Null) ? | ||
| field.defaultVal() : null; | ||
| Object defaultValue = | ||
| field.hasDefaultValue() && AvroSchemaUtil.isNonNullDefault(field.defaultVal()) ? field.defaultVal() : null; |
There was a problem hiding this comment.
What is the value of calling isNonNullDefault here? It just another way to do the same check as before !(field.defaultVal() instanceof JsonProperties.Null)?
core/src/test/java/org/apache/iceberg/TestSchemaParserForDefaultValues.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/TestSchemaParserForDefaultValues.java
Show resolved
Hide resolved
|
|
||
| private static String toJsonString(Object value) { | ||
| if (isPrimitiveClass(value)) { | ||
| return value.toString(); |
There was a problem hiding this comment.
For FIXED and BINARY, should do something like Base64.getEncoder().encodeToString(bytes)? (also the opposite transformation on the deserialization side).
There was a problem hiding this comment.
wrapped into ByteBuffer
funcheetah
left a comment
There was a problem hiding this comment.
Thanks @shenodaguirguis ! This PR looks good to me in general and leave a few minor comments. One question I have: is default value for complex union supported?
core/src/test/java/org/apache/iceberg/avro/TestAvroOptionsWithNonNullDefaults.java
Outdated
Show resolved
Hide resolved
c5de40a to
1a15a81
Compare
funcheetah
left a comment
There was a problem hiding this comment.
Thanks a lot @shenodaguirguis for this feature!
Last change of 3 (see description: apache/iceberg#2496 (comment))
The main changes here are in BuildAvroProjection.java and PruneColumns.java which makes sure that default value is copied over and used while reading projected columns with default values.
Other changes are utils, and testing changes.
There will be a followup PR to handle default values in serialization/deserializaiton, but this can go on parallel with ORC and Parquet changes.
[Update: during integration testing, I had to implement the default value serialization/deserialization. It works now on spark shell to select missing columns with default values and return the default value:
Also, during integration testing, it turns out that Avro default values for records are Maps, not Lists, so had to change that in the NestedField class.